LLM 25-Day Course - Day 6: What Are LLMs? The Principles of Large Language Models

Day 6: What Are LLMs? The Principles of Large Language Models

An LLM (Large Language Model) is not simply a “big model.” Once it surpasses a certain scale, abilities that were impossible in smaller models suddenly emerge. This is what makes LLMs special.

The Core of LLMs: Next Token Prediction

The training objective of LLMs is surprisingly simple. Predicting the next token is all there is to it.

# Pre-training of LLMs boils down to this
# Input: "The weather today is really"
# Target: "nice"

# By repeating this task over trillions of tokens,
# the model learns grammar, facts, and even reasoning abilities

def next_token_prediction_loss(model, text_tokens):
    """The core loss function of pre-training"""
    total_loss = 0
    for i in range(1, len(text_tokens)):
        context = text_tokens[:i]          # Previous tokens
        target = text_tokens[i]            # Next token (ground truth)
        predicted = model(context)         # Model's prediction
        loss = cross_entropy(predicted, target)
        total_loss += loss
    return total_loss / (len(text_tokens) - 1)

Scaling Laws

Optimal training rules revealed by the Chinchilla paper (2022):

Parameters	Optimal Tokens	Example Model
1B	20B tokens	Phi-2 class
7B	140B tokens	Llama 2 7B
13B	260B tokens	Llama 2 13B
70B	1.4T tokens	Llama 2 70B

# Chinchilla law: optimal tokens ~ 20 x parameter count
def optimal_tokens(num_params_billions):
    return num_params_billions * 20  # Unit: billions (B) of tokens

# Example
for size in [1, 7, 13, 70]:
    tokens = optimal_tokens(size)
    print(f"{size}B model -> optimal {tokens}B tokens for training")

However, recent models often exceed this rule. Llama 3 8B was trained on 15T tokens, roughly 100 times the Chinchilla optimal.

Characteristics by Parameter Scale

model_capabilities = {
    "Under 1B": {
        "capable": ["Simple classification", "Sentiment analysis", "Keyword extraction"],
        "difficult": ["Complex reasoning", "Long text generation", "Code generation"],
        "examples": "DistilBERT, TinyLlama",
    },
    "7B - 13B": {
        "capable": ["General conversation", "Basic coding", "Summarization", "Translation"],
        "difficult": ["Mathematical reasoning", "Complex analysis"],
        "examples": "Llama 3 8B, Mistral 7B",
    },
    "30B - 70B": {
        "capable": ["Complex reasoning", "Advanced coding", "Long document analysis"],
        "difficult": ["Top-tier mathematics", "Specialized medical/legal"],
        "examples": "Llama 3 70B, Mixtral 8x7B",
    },
    "Over 100B": {
        "capable": ["Advanced reasoning", "Multi-turn conversation", "Creative writing"],
        "note": "Emergent abilities fully manifest",
        "examples": "Latest top-tier GPT models, Latest Claude Opus",
    },
}

for size, info in model_capabilities.items():
    print(f"\n{'='*40}")
    print(f"Scale: {size}")
    print(f"Capable of: {', '.join(info['capable'])}")
    print(f"Examples: {info['examples']}")

Emergent Abilities

# Emergent abilities: capabilities that suddenly appear when model size crosses a threshold
# Success rate is near 0% in small models, then spikes at a certain size

emergent_abilities = {
    "Chain-of-Thought reasoning": "Reaching correct answers through step-by-step thought processes",
    "Few-shot learning": "Performing new tasks after seeing just a few examples",
    "Code generation": "Converting natural language descriptions into code",
    "Multilingual translation": "Translating language pairs not seen during training",
    "Arithmetic reasoning": "Solving multi-step math problems",
}

print("Emergent abilities of LLMs:")
for ability, description in emergent_abilities.items():
    print(f"  - {ability}: {description}")

# Note: Recent research argues that "emergent abilities" may be
# an artifact of the measurement method.
# The claim is that the sharp transition disappears when
# evaluation metrics are made continuous.

The power of LLMs comes from performing a simple objective (next token prediction) at an extreme scale. Starting tomorrow, we’ll examine actual LLM models one by one.

Today’s Exercises

Intuitively explain how the ability to answer questions can emerge from “just predicting the next token.” (Hint: the training data includes Q&A-format text)
According to the Chinchilla law, calculate the optimal number of training tokens for a 3B parameter model, and research how many tokens Phi-3 (3.8B) was actually trained on.
Research the argument that emergent abilities are “measurement artifacts” and form your own opinion.